Skip to content

[2/N][refactor] torchair deepseek mla backend refactor#2459

Merged
wangxiyuan merged 1 commit into
vllm-project:mainfrom
linfeng-yuan:torchair_deepseek_modeling_refactore_01
Aug 21, 2025
Merged

[2/N][refactor] torchair deepseek mla backend refactor#2459
wangxiyuan merged 1 commit into
vllm-project:mainfrom
linfeng-yuan:torchair_deepseek_modeling_refactore_01

Conversation

@linfeng-yuan
Copy link
Copy Markdown
Collaborator

@linfeng-yuan linfeng-yuan commented Aug 20, 2025

What this PR does / why we need it?

This PR move current unified mla backend to torchair folder and remove torchair-related code in attention/mla_v1.py (1.3k -> 0.9k).

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Running eager mode with mla backend, and torchair mode with code before 2445

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the attention backend selection logic and introduces a new TorchAir MLA backend for DeepSeek models on Ascend NPUs. The refactoring in platform.py correctly handles the selection of different attention backends. The new implementation in vllm_ascend/torchair/torchair_mla.py adds the TorchAir-based MLA backend. I've found a critical issue in the decode path of this new implementation that needs to be addressed.

Comment thread vllm_ascend/torchair/torchair_mla.py
@github-actions
Copy link
Copy Markdown
Contributor

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@linfeng-yuan linfeng-yuan force-pushed the torchair_deepseek_modeling_refactore_01 branch from 9d75ec4 to 5c278a8 Compare August 20, 2025 15:46
@linfeng-yuan linfeng-yuan force-pushed the torchair_deepseek_modeling_refactore_01 branch 2 times, most recently from 6ad2f4d to 0b5b5b2 Compare August 20, 2025 16:59
@codecov
Copy link
Copy Markdown

codecov Bot commented Aug 20, 2025

Codecov Report

❌ Patch coverage is 79.62264% with 216 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.56%. Comparing base (2bb7e55) to head (0b5b5b2).
⚠️ Report is 24 commits behind head on main.

Files with missing lines Patch % Lines
vllm_ascend/torchair/torchair_mla.py 61.70% 211 Missing ⚠️
vllm_ascend/attention/mla_v1.py 85.29% 5 Missing ⚠️

❌ Your patch check has failed because the patch coverage (79.62%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2459      +/-   ##
==========================================
+ Coverage   76.18%   77.56%   +1.37%     
==========================================
  Files         120      130      +10     
  Lines       13532    17149    +3617     
==========================================
+ Hits        10310    13302    +2992     
- Misses       3222     3847     +625     
Flag Coverage Δ
unittests 77.56% <79.62%> (+1.37%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions
Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@wangxiyuan
Copy link
Copy Markdown
Collaborator

@linfeng-yuan linfeng-yuan force-pushed the torchair_deepseek_modeling_refactore_01 branch 2 times, most recently from 582e3b8 to 9b0233d Compare August 21, 2025 01:59
@linfeng-yuan linfeng-yuan force-pushed the torchair_deepseek_modeling_refactore_01 branch 3 times, most recently from afd1b79 to 2d4e437 Compare August 21, 2025 04:06
Signed-off-by: linfeng-yuan <1102311262@qq.com>
@linfeng-yuan linfeng-yuan force-pushed the torchair_deepseek_modeling_refactore_01 branch from 2d4e437 to d8671e8 Compare August 21, 2025 04:51
@wangxiyuan
Copy link
Copy Markdown
Collaborator

CI failue doesn't relate to this PR

@wangxiyuan wangxiyuan merged commit 0ca3f48 into vllm-project:main Aug 21, 2025
18 of 20 checks passed
wangxiaoteng888 pushed a commit to LCAIZJ/vllm-ascend that referenced this pull request Sep 25, 2025
…2459)

### What this PR does / why we need it?
This PR move current unified mla backend to torchair folder and remove
torchair-related code in attention/mla_v1.py (1.3k -> 0.9k).

 
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Running eager mode with mla backend, and torchair mode with code before
[2445](vllm-project#2445)


- vLLM version: v0.10.0
- vLLM main:
vllm-project/vllm@f571ff8

Signed-off-by: linfeng-yuan <1102311262@qq.com>
chopper0126 pushed a commit to chopper0126/vllm-ascend that referenced this pull request Sep 26, 2025
…2459)

### What this PR does / why we need it?
This PR move current unified mla backend to torchair folder and remove
torchair-related code in attention/mla_v1.py (1.3k -> 0.9k).

 
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Running eager mode with mla backend, and torchair mode with code before
[2445](vllm-project#2445)


- vLLM version: v0.10.0
- vLLM main:
vllm-project/vllm@f571ff8

Signed-off-by: linfeng-yuan <1102311262@qq.com>
Angazenn pushed a commit to Angazenn/vllm-ascend that referenced this pull request Oct 21, 2025
…2459)

### What this PR does / why we need it?
This PR move current unified mla backend to torchair folder and remove
torchair-related code in attention/mla_v1.py (1.3k -> 0.9k).

 
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Running eager mode with mla backend, and torchair mode with code before
[2445](vllm-project#2445)


- vLLM version: v0.10.0
- vLLM main:
vllm-project/vllm@f571ff8

Signed-off-by: linfeng-yuan <1102311262@qq.com>
Clorist33 pushed a commit to Clorist33/vllm-ascend that referenced this pull request Dec 9, 2025
…2459)

### What this PR does / why we need it?
This PR move current unified mla backend to torchair folder and remove
torchair-related code in attention/mla_v1.py (1.3k -> 0.9k).

 
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Running eager mode with mla backend, and torchair mode with code before
[2445](vllm-project#2445)


- vLLM version: v0.10.0
- vLLM main:
vllm-project/vllm@f571ff8

Signed-off-by: linfeng-yuan <1102311262@qq.com>
yangzhe-2026 pushed a commit to yangzhe-2026/vllm-ascend that referenced this pull request May 6, 2026
…2459)

### What this PR does / why we need it?
This PR move current unified mla backend to torchair folder and remove
torchair-related code in attention/mla_v1.py (1.3k -> 0.9k).

 
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Running eager mode with mla backend, and torchair mode with code before
[2445](vllm-project#2445)


- vLLM version: v0.10.0
- vLLM main:
vllm-project/vllm@f571ff8

Signed-off-by: linfeng-yuan <1102311262@qq.com>
linfeng-yuan pushed a commit that referenced this pull request May 9, 2026
- ✅ **Review Quality:**
He has completed [50+
reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+reviewed-by%3Alinfeng-yuan)
since April 2025, covering graph mode, MoE, quantization, model support,
and performance-related changes.

In addition to regular review work, he has also participated in complex
feature development and review, such as
[#6670](#6670) (MoE
MXFP8 quantization), where he helped with A5 MXFP8 integration,
compatibility cleanup, dispatch updates, and implementation fixes.

- ✅ **Sustained Contributions:**
He has [60+ merged
PRs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Amerged+author%3Alinfeng-yuan)
since April 2025, with continuous activity across major release cycles.

- ✅ **Quality Contributions:**

  **Torchair Graph Mode & Wide-EP / MoE — Feature Owner (2025 Q2~Q4):**
He was the Feature Owner for DeepSeek high-throughput inference under
torchair graph mode and the Wide-EP project. He drove graph mode
performance optimization
([#731](#731)), landed
super-kernel fusion for quantized DSR1
([#3485](#3485)), and
added initial MoE support for Model Runner v2
([#7922](#7922)).

  **Ascend950 (A5) — Feature Owner:**
He authored the [RFC roadmap
(#7157)](#7157) for A5
support, landed initial build support
([#7151](#7151)),
co-authored MXFP8 and MXFP4 quantization support for A5
([#6670](#6670),
[#7877](#7877)), and
fixed the MXFP8 scale normalization issue that unblocked A5 quantized
inference
([#7573](#7573)).

  **DeepSeek Low-Latency & Post-Processing:**
He improved DSv3.2 performance by eliminating HD synchronization
([#4805](#4805)),
improved rejection sampler performance and eliminated D2H sync in
TopKTopPSampler
([#4154](#4154)), and
added a penalty-related Triton kernel for sampling performance
([#7794](#7794)).

- ✅ **Community Involvement:**
He led a 2-part torchair modeling refactor
([#2384](#2384),
[#2459](#2459)) and
deleted ~2K lines of redundant DeepSeek modeling code as upstream
absorbed the changes
([#2849](#2849)). He
also replaced scattered business kwargs with typed request objects
across MoE stage boundaries
([#7024](#7024)).

Since March 2026, he has taken part in issue triage and user support,
responding to [30+
issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue+commenter%3Alinfeng-yuan+updated%3A%3E2026-03-01)
covering graph mode failures, quantization accuracy regressions, MoE
deployment problems, and multi-node communication issues.

- vLLM version: v0.19.1
- vLLM main:
vllm-project/vllm@4d51588

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
yangzhe-2026 pushed a commit to yangzhe-2026/vllm-ascend that referenced this pull request May 10, 2026
- ✅ **Review Quality:**
He has completed [50+
reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+reviewed-by%3Alinfeng-yuan)
since April 2025, covering graph mode, MoE, quantization, model support,
and performance-related changes.

In addition to regular review work, he has also participated in complex
feature development and review, such as
[vllm-project#6670](vllm-project#6670) (MoE
MXFP8 quantization), where he helped with A5 MXFP8 integration,
compatibility cleanup, dispatch updates, and implementation fixes.

- ✅ **Sustained Contributions:**
He has [60+ merged
PRs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Amerged+author%3Alinfeng-yuan)
since April 2025, with continuous activity across major release cycles.

- ✅ **Quality Contributions:**

  **Torchair Graph Mode & Wide-EP / MoE — Feature Owner (2025 Q2~Q4):**
He was the Feature Owner for DeepSeek high-throughput inference under
torchair graph mode and the Wide-EP project. He drove graph mode
performance optimization
([vllm-project#731](vllm-project#731)), landed
super-kernel fusion for quantized DSR1
([vllm-project#3485](vllm-project#3485)), and
added initial MoE support for Model Runner v2
([vllm-project#7922](vllm-project#7922)).

  **Ascend950 (A5) — Feature Owner:**
He authored the [RFC roadmap
(vllm-project#7157)](vllm-project#7157) for A5
support, landed initial build support
([vllm-project#7151](vllm-project#7151)),
co-authored MXFP8 and MXFP4 quantization support for A5
([vllm-project#6670](vllm-project#6670),
[vllm-project#7877](vllm-project#7877)), and
fixed the MXFP8 scale normalization issue that unblocked A5 quantized
inference
([vllm-project#7573](vllm-project#7573)).

  **DeepSeek Low-Latency & Post-Processing:**
He improved DSv3.2 performance by eliminating HD synchronization
([vllm-project#4805](vllm-project#4805)),
improved rejection sampler performance and eliminated D2H sync in
TopKTopPSampler
([vllm-project#4154](vllm-project#4154)), and
added a penalty-related Triton kernel for sampling performance
([vllm-project#7794](vllm-project#7794)).

- ✅ **Community Involvement:**
He led a 2-part torchair modeling refactor
([vllm-project#2384](vllm-project#2384),
[vllm-project#2459](vllm-project#2459)) and
deleted ~2K lines of redundant DeepSeek modeling code as upstream
absorbed the changes
([vllm-project#2849](vllm-project#2849)). He
also replaced scattered business kwargs with typed request objects
across MoE stage boundaries
([vllm-project#7024](vllm-project#7024)).

Since March 2026, he has taken part in issue triage and user support,
responding to [30+
issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue+commenter%3Alinfeng-yuan+updated%3A%3E2026-03-01)
covering graph mode failures, quantization accuracy regressions, MoE
deployment problems, and multi-node communication issues.

- vLLM version: v0.19.1
- vLLM main:
vllm-project/vllm@4d51588

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: yangzhe-2026 <yangzhe@isrc.iscas.ac.cn>
SOMEONEUNSEEN pushed a commit to SOMEONEUNSEEN/vllm-ascend that referenced this pull request May 11, 2026
- ✅ **Review Quality:**
He has completed [50+
reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+reviewed-by%3Alinfeng-yuan)
since April 2025, covering graph mode, MoE, quantization, model support,
and performance-related changes.

In addition to regular review work, he has also participated in complex
feature development and review, such as
[vllm-project#6670](vllm-project#6670) (MoE
MXFP8 quantization), where he helped with A5 MXFP8 integration,
compatibility cleanup, dispatch updates, and implementation fixes.

- ✅ **Sustained Contributions:**
He has [60+ merged
PRs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Amerged+author%3Alinfeng-yuan)
since April 2025, with continuous activity across major release cycles.

- ✅ **Quality Contributions:**

  **Torchair Graph Mode & Wide-EP / MoE — Feature Owner (2025 Q2~Q4):**
He was the Feature Owner for DeepSeek high-throughput inference under
torchair graph mode and the Wide-EP project. He drove graph mode
performance optimization
([vllm-project#731](vllm-project#731)), landed
super-kernel fusion for quantized DSR1
([vllm-project#3485](vllm-project#3485)), and
added initial MoE support for Model Runner v2
([vllm-project#7922](vllm-project#7922)).

  **Ascend950 (A5) — Feature Owner:**
He authored the [RFC roadmap
(vllm-project#7157)](vllm-project#7157) for A5
support, landed initial build support
([vllm-project#7151](vllm-project#7151)),
co-authored MXFP8 and MXFP4 quantization support for A5
([vllm-project#6670](vllm-project#6670),
[vllm-project#7877](vllm-project#7877)), and
fixed the MXFP8 scale normalization issue that unblocked A5 quantized
inference
([vllm-project#7573](vllm-project#7573)).

  **DeepSeek Low-Latency & Post-Processing:**
He improved DSv3.2 performance by eliminating HD synchronization
([vllm-project#4805](vllm-project#4805)),
improved rejection sampler performance and eliminated D2H sync in
TopKTopPSampler
([vllm-project#4154](vllm-project#4154)), and
added a penalty-related Triton kernel for sampling performance
([vllm-project#7794](vllm-project#7794)).

- ✅ **Community Involvement:**
He led a 2-part torchair modeling refactor
([vllm-project#2384](vllm-project#2384),
[vllm-project#2459](vllm-project#2459)) and
deleted ~2K lines of redundant DeepSeek modeling code as upstream
absorbed the changes
([vllm-project#2849](vllm-project#2849)). He
also replaced scattered business kwargs with typed request objects
across MoE stage boundaries
([vllm-project#7024](vllm-project#7024)).

Since March 2026, he has taken part in issue triage and user support,
responding to [30+
issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue+commenter%3Alinfeng-yuan+updated%3A%3E2026-03-01)
covering graph mode failures, quantization accuracy regressions, MoE
deployment problems, and multi-node communication issues.

- vLLM version: v0.19.1
- vLLM main:
vllm-project/vllm@4d51588

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
ZhuQi-seu pushed a commit to ZhuQi-seu/vllm-ascend that referenced this pull request May 11, 2026
- ✅ **Review Quality:**
He has completed [50+
reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+reviewed-by%3Alinfeng-yuan)
since April 2025, covering graph mode, MoE, quantization, model support,
and performance-related changes.

In addition to regular review work, he has also participated in complex
feature development and review, such as
[vllm-project#6670](vllm-project#6670) (MoE
MXFP8 quantization), where he helped with A5 MXFP8 integration,
compatibility cleanup, dispatch updates, and implementation fixes.

- ✅ **Sustained Contributions:**
He has [60+ merged
PRs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Amerged+author%3Alinfeng-yuan)
since April 2025, with continuous activity across major release cycles.

- ✅ **Quality Contributions:**

  **Torchair Graph Mode & Wide-EP / MoE — Feature Owner (2025 Q2~Q4):**
He was the Feature Owner for DeepSeek high-throughput inference under
torchair graph mode and the Wide-EP project. He drove graph mode
performance optimization
([vllm-project#731](vllm-project#731)), landed
super-kernel fusion for quantized DSR1
([vllm-project#3485](vllm-project#3485)), and
added initial MoE support for Model Runner v2
([vllm-project#7922](vllm-project#7922)).

  **Ascend950 (A5) — Feature Owner:**
He authored the [RFC roadmap
(vllm-project#7157)](vllm-project#7157) for A5
support, landed initial build support
([vllm-project#7151](vllm-project#7151)),
co-authored MXFP8 and MXFP4 quantization support for A5
([vllm-project#6670](vllm-project#6670),
[vllm-project#7877](vllm-project#7877)), and
fixed the MXFP8 scale normalization issue that unblocked A5 quantized
inference
([vllm-project#7573](vllm-project#7573)).

  **DeepSeek Low-Latency & Post-Processing:**
He improved DSv3.2 performance by eliminating HD synchronization
([vllm-project#4805](vllm-project#4805)),
improved rejection sampler performance and eliminated D2H sync in
TopKTopPSampler
([vllm-project#4154](vllm-project#4154)), and
added a penalty-related Triton kernel for sampling performance
([vllm-project#7794](vllm-project#7794)).

- ✅ **Community Involvement:**
He led a 2-part torchair modeling refactor
([vllm-project#2384](vllm-project#2384),
[vllm-project#2459](vllm-project#2459)) and
deleted ~2K lines of redundant DeepSeek modeling code as upstream
absorbed the changes
([vllm-project#2849](vllm-project#2849)). He
also replaced scattered business kwargs with typed request objects
across MoE stage boundaries
([vllm-project#7024](vllm-project#7024)).

Since March 2026, he has taken part in issue triage and user support,
responding to [30+
issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue+commenter%3Alinfeng-yuan+updated%3A%3E2026-03-01)
covering graph mode failures, quantization accuracy regressions, MoE
deployment problems, and multi-node communication issues.

- vLLM version: v0.19.1
- vLLM main:
vllm-project/vllm@4d51588

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: zhuqi <z00480217@china.huawei.com>
ZhuQi-seu pushed a commit to ZhuQi-seu/vllm-ascend that referenced this pull request May 11, 2026
- ✅ **Review Quality:**
He has completed [50+
reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+reviewed-by%3Alinfeng-yuan)
since April 2025, covering graph mode, MoE, quantization, model support,
and performance-related changes.

In addition to regular review work, he has also participated in complex
feature development and review, such as
[vllm-project#6670](vllm-project#6670) (MoE
MXFP8 quantization), where he helped with A5 MXFP8 integration,
compatibility cleanup, dispatch updates, and implementation fixes.

- ✅ **Sustained Contributions:**
He has [60+ merged
PRs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Amerged+author%3Alinfeng-yuan)
since April 2025, with continuous activity across major release cycles.

- ✅ **Quality Contributions:**

  **Torchair Graph Mode & Wide-EP / MoE — Feature Owner (2025 Q2~Q4):**
He was the Feature Owner for DeepSeek high-throughput inference under
torchair graph mode and the Wide-EP project. He drove graph mode
performance optimization
([vllm-project#731](vllm-project#731)), landed
super-kernel fusion for quantized DSR1
([vllm-project#3485](vllm-project#3485)), and
added initial MoE support for Model Runner v2
([vllm-project#7922](vllm-project#7922)).

  **Ascend950 (A5) — Feature Owner:**
He authored the [RFC roadmap
(vllm-project#7157)](vllm-project#7157) for A5
support, landed initial build support
([vllm-project#7151](vllm-project#7151)),
co-authored MXFP8 and MXFP4 quantization support for A5
([vllm-project#6670](vllm-project#6670),
[vllm-project#7877](vllm-project#7877)), and
fixed the MXFP8 scale normalization issue that unblocked A5 quantized
inference
([vllm-project#7573](vllm-project#7573)).

  **DeepSeek Low-Latency & Post-Processing:**
He improved DSv3.2 performance by eliminating HD synchronization
([vllm-project#4805](vllm-project#4805)),
improved rejection sampler performance and eliminated D2H sync in
TopKTopPSampler
([vllm-project#4154](vllm-project#4154)), and
added a penalty-related Triton kernel for sampling performance
([vllm-project#7794](vllm-project#7794)).

- ✅ **Community Involvement:**
He led a 2-part torchair modeling refactor
([vllm-project#2384](vllm-project#2384),
[vllm-project#2459](vllm-project#2459)) and
deleted ~2K lines of redundant DeepSeek modeling code as upstream
absorbed the changes
([vllm-project#2849](vllm-project#2849)). He
also replaced scattered business kwargs with typed request objects
across MoE stage boundaries
([vllm-project#7024](vllm-project#7024)).

Since March 2026, he has taken part in issue triage and user support,
responding to [30+
issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue+commenter%3Alinfeng-yuan+updated%3A%3E2026-03-01)
covering graph mode failures, quantization accuracy regressions, MoE
deployment problems, and multi-node communication issues.

- vLLM version: v0.19.1
- vLLM main:
vllm-project/vllm@4d51588

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: zhuqi <z00480217@china.huawei.com>
Signed-off-by: ZhuQi-seu <zhuqi12@huawei.com>
ZhuQi-seu pushed a commit to ZhuQi-seu/vllm-ascend that referenced this pull request May 12, 2026
- ✅ **Review Quality:**
He has completed [50+
reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+reviewed-by%3Alinfeng-yuan)
since April 2025, covering graph mode, MoE, quantization, model support,
and performance-related changes.

In addition to regular review work, he has also participated in complex
feature development and review, such as
[vllm-project#6670](vllm-project#6670) (MoE
MXFP8 quantization), where he helped with A5 MXFP8 integration,
compatibility cleanup, dispatch updates, and implementation fixes.

- ✅ **Sustained Contributions:**
He has [60+ merged
PRs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Amerged+author%3Alinfeng-yuan)
since April 2025, with continuous activity across major release cycles.

- ✅ **Quality Contributions:**

  **Torchair Graph Mode & Wide-EP / MoE — Feature Owner (2025 Q2~Q4):**
He was the Feature Owner for DeepSeek high-throughput inference under
torchair graph mode and the Wide-EP project. He drove graph mode
performance optimization
([vllm-project#731](vllm-project#731)), landed
super-kernel fusion for quantized DSR1
([vllm-project#3485](vllm-project#3485)), and
added initial MoE support for Model Runner v2
([vllm-project#7922](vllm-project#7922)).

  **Ascend950 (A5) — Feature Owner:**
He authored the [RFC roadmap
(vllm-project#7157)](vllm-project#7157) for A5
support, landed initial build support
([vllm-project#7151](vllm-project#7151)),
co-authored MXFP8 and MXFP4 quantization support for A5
([vllm-project#6670](vllm-project#6670),
[vllm-project#7877](vllm-project#7877)), and
fixed the MXFP8 scale normalization issue that unblocked A5 quantized
inference
([vllm-project#7573](vllm-project#7573)).

  **DeepSeek Low-Latency & Post-Processing:**
He improved DSv3.2 performance by eliminating HD synchronization
([vllm-project#4805](vllm-project#4805)),
improved rejection sampler performance and eliminated D2H sync in
TopKTopPSampler
([vllm-project#4154](vllm-project#4154)), and
added a penalty-related Triton kernel for sampling performance
([vllm-project#7794](vllm-project#7794)).

- ✅ **Community Involvement:**
He led a 2-part torchair modeling refactor
([vllm-project#2384](vllm-project#2384),
[vllm-project#2459](vllm-project#2459)) and
deleted ~2K lines of redundant DeepSeek modeling code as upstream
absorbed the changes
([vllm-project#2849](vllm-project#2849)). He
also replaced scattered business kwargs with typed request objects
across MoE stage boundaries
([vllm-project#7024](vllm-project#7024)).

Since March 2026, he has taken part in issue triage and user support,
responding to [30+
issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue+commenter%3Alinfeng-yuan+updated%3A%3E2026-03-01)
covering graph mode failures, quantization accuracy regressions, MoE
deployment problems, and multi-node communication issues.

- vLLM version: v0.19.1
- vLLM main:
vllm-project/vllm@4d51588

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: ZhuQi-seu <zhuqi12@huawei.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants